Skills: Describing variables
Week 2
This page illustrates some skills that will be useful to you as you learn to describe variables in a dataset
For the R examples, I’ll be using the following packages and load a
dataset called my_data.
library(tidyverse)
library(knitr)
library(here)
my_data <- here("datasets",
"example2.csv") |>
read_csv() |>
select(-id)
colnames(my_data) <- c("train_dist",
"park_dist",
"hosp_dist",
"school",
"price")Descriptive statistics
You may want to describe your data with descriptive statistics.
Mean
The mean is the average value of a variable. It is a measure of central tendency that is sensitive to outliers.
R
The function in R that will calculate a mean is called
mean().
## [1] 1043.235
Excel
In Excel, the function is AVERAGE().
Standard deviation
The standard deviation is a measure of how spread out the data are. It is sensitive to outliers.
Excel
In Excel, the function to calculate a standard deviation is
STDEV.P().
Median and other percentiles
Percentiles are the values that a specified percentage of observations of a variable are lower than. The 50th percentile is also called the median: It’s the value that half of all observations are less than.
The median is measure of central tendency that is not sensitive to outliers.
The range between lower and upper percentiles can be a useful measure of central tendency. For example, the interquartile range (IQR) is the range between the 25th and 75th percentile. Half of all values fall within this range.
R
In R, you can calculate a percentile with the function
quantile(). The median is the 50th percentile.
## 50%
## 1040.074
You can also calculate the 25th and 75th percentiles.
## 25% 75%
## 866.5128 1222.3472
And the difference between them is the interquartile range.
## [1] 355.8344
Excel
In Excel, the function to calculate percentiles is
PERCENTILE.EXC().
Proportions
Averages and standard deviations don’t make sense for categorical variables. The appropriate descriptive statistics would be proportions.
R
In R, you can use the table() function to get the number
of observations in each category.
##
## East Edgerly Healey Kennedy West
## 5725 327 218 2259 1471
And then you can express these as proportions by dividing those numbers by the number of observations.
##
## East Edgerly Healey Kennedy West
## 0.5725 0.0327 0.0218 0.2259 0.1471
Excel
If you’ve set up a column for each category, where a value of 1 means the observation is in that category and a value of zero means it is not, the average of one of those columns is the proportion of observations within that category.
You can use an IF() function in Excel to convert
categories to zeros and ones.
Confidence intervals
In general, the data sets you work with represent samples of larger populations. Your goal is to be able to generalize from your sample to the population.
Sometimes this is literally and obviously true. For example, you might have sent a survey to one percent of all Canadian households, and you want to analyze the data to make some claims about the characteristics of Canadian households in general.
Other times, describing your sample in terms of the population it comes from is not as obvious. For example, you might have data about every county in the United States suggesting that incomes are higher in urban counties than in rural counties. You might argue that this is not a sample, because you have data on every county that exists. However, if you want to claim that this difference is based on a fundamental difference that is likely to always exist between urban and rural counties in the United States, then you need to be able to generalize this difference to a broader population, which would be all hypothetical counties that could ever exist in the United States.
Confidence intervals for averages
A confidence interval for an average is a range of values that you can have a specified level of confidence that the the average value for the full population falls within. A one-sample t-test is a method for calculating a confidence interval.
R
Here is how you would do a one-sample t-test in R.
##
## One Sample t-test
##
## data: my_data$price
## t = 400.24, df = 9999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 1038.126 1048.344
## sample estimates:
## mean of x
## 1043.235
By default, the t.test function returns a 95-percent
confidence interval, but you can also specify other confidence levels.
Here is how you would get the 99-percent confidence level.
##
## One Sample t-test
##
## data: my_data$price
## t = 400.24, df = 9999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
## 1036.52 1049.95
## sample estimates:
## mean of x
## 1043.235
Excel
To calculate the 95-percent confidence interval in Excel, you’d need to know that the confidence margin is the standard deviation times the z-score corresponding to the confidence level you want.
So, first you calculate the standard deviation of the sample.
Then you calculate the z-score for the desired confidence level using
the function NORM.INV.S(). The probability you should use
for a 95% confidence level is 0.975. Or, if you use a probability of
0.025, you’ll get the opposite (i.e. negative) of the z-score you
need.
Now you can calculate the upper and lower bounds of the confidence interval.
Confidence intervals for proportions
Confidence intervals for proportions work the same way as confidence intervals for averages, but you use the standard error of the proportion instead of the standard deviation.
R
In R, you use prop.test rather than t.test when you want to find the confidence interval for a proportion.
##
## 1-sample proportions test with continuity correction
##
## data: sum(my_data$school == "East") out of nrow(my_data), null probability 0.5
## X-squared = 209.96, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.5627276 0.5822164
## sample estimates:
## p
## 0.5725
Excel
To calculate the 95-percent confidence interval in Excel, you’d need to know that the confidence margin is the standard error times the z-score corresponding to the confidence level you want.
The standard error is given by the equation:
\[ S.E. = \sqrt{\frac{p\times(1-p)}{n}} \]
Where p is the proportion and n is the number of observations.
Then you calculate the z-score and the upper and lower bounds of the confidence interval as you would for a continuous variable.
Visualizing data
You might find it useful to visualize a variable in your dataset using a histogram, and one-dimensional scatterplot, or a bar chart.
Histogram
A histogram is a useful way to visualize the distribution of a continuous variable.
R
In R (ggplot2), geom_histogram will generate a
histogram.
Here, I’ll visualize the distribution of a variable in my dataset called “store size.”
Excel
You can create a histogram in Excel using the data analysis tool pack. By default, it will put gaps between the bars. It is more common to show histograms without these gaps, so you can remove them.
One-dimesional scatterplot
A one-dimentional scatteplot can be a much more literal representation of the the distribution of a dataset.
R
In R, you can use geom_jitter() to generate a scattplot.
You can set the x-coordinate as the variable you want to visualize, and
set the y-coordinate as a constant, and R will randomly vary the y
positions.
You might want to turn off the y-axis labels since they’ll be meaningless.
You might also want to make the points a little bit transparent to avoid overplotting.
ggplot(my_data) +
geom_jitter(aes(x = price, y = 0),
alpha = 0.1) +
theme_linedraw() +
theme(axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank())Excel
In Excel, to acheive that same effect, you need to create a fake y-variable that varies randomly.
Then you can create a scatterplot.
Bar charts
You might use a bar chart to represent proportions.
R
In R, you can use geom_bar() to represent proportions
with a stacked bar chart.
ggplot(my_data) +
geom_bar(aes(fill = school, x = 1),
position = "fill") +
scale_y_continuous(name = "Proportion",
breaks = c(0.2,
0.4,
0.6,
0.8,
1),
labels = c("20%",
"40%",
"60%",
"80%",
"100%")) +
theme_linedraw() +
theme(axis.text.x = element_blank(),
axis.title.x = element_blank(),
axis.ticks.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())ggplot(my_data) +
geom_bar(aes(x = school)) +
scale_y_continuous(breaks = breaks <- c(2000, 4000, 6000),
labels = paste0(100*breaks / nrow(my_data), "%"),
name = "Percentage") +
theme_linedraw() You can also calculate the proportions and their confidence intervals and make a little table for them like this.
prop_summary <-my_data |>
mutate(school_east = school == "East",
school_edgerly = school == "Edgerly",
school_healey = school == "Healey",
school_kennedy = school == "Kennedy",
school_west = school == "West") |>
select(school_east,
school_edgerly,
school_healey,
school_kennedy,
school_west) |>
pivot_longer(cols = everything(),
names_to = "which_school",
values_to = "is_school") |>
group_by(which_school) |>
summarise(Proportion = mean(is_school),
`C.I. Low` = prop.test(x = sum(is_school==1),
n = n(),
conf.level = 0.95)$conf.int[1],
`C.I. High` = prop.test(x = sum(is_school==1),
n = n(),
conf.level = 0.95)$conf.int[2])
prop_summary## # A tibble: 5 × 4
## which_school Proportion `C.I. Low` `C.I. High`
## <chr> <dbl> <dbl> <dbl>
## 1 school_east 0.572 0.563 0.582
## 2 school_edgerly 0.0327 0.0293 0.0364
## 3 school_healey 0.0218 0.0191 0.0249
## 4 school_kennedy 0.226 0.218 0.234
## 5 school_west 0.147 0.140 0.154
And then we can use that to make a bar chart with error bars
ggplot(prop_summary,
aes(x = which_school,
y = Proportion)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = `C.I. Low`,
ymax = `C.I. High`)) +
theme_linedraw()